Red Wine Quality Analysis by Ravi Verma

Structure of data.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The dataset consists of 13 variables and 1599 observations. The first variable ‘X’ represents id and last variable ‘quality’ represents the quaity of wine. Rest of the variables repreents the characterstics wihich defined the quality.

Univariate Plots Section

Summary of dataset.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

From the plot, we can see that most of the wine had Quality ratings between 5 and 7.

Fixed Acidity alomost has normal distribution but it is slightly skewed towards right. It has some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile acidity almost has normal distribution, it also has some outliers.

Citric acid is not normally distributes, the graph is almost rectangular. It also has some outliers.

Residual sugar is positively skewed and has some extreme ouliers

Chlordes is positively skewed, same as residual sugar. It has some extreme outliers

Free sulphur dioxide is positively skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Total sulphur dioxide is positively skewed with some extreme outliers.

Density has normal distribution.

pH has normal distribution.

Sulphates has right skewed distribution.

Alcohol is also skewed towards right.

Log10 trandformed plot for better distribution.

Univariate Analysis

What is the structure of your dataset?

The dataset has 1599 entries and 13 variables (X,fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol,quality,). Most of the entries have quality between 5 and 7. Most of the people randomly gave rating 5 and 6. Nobody gave rating 0,1,2,9,10 which means that quality of wine is not so good.

What is/are the main feature(s) of interest in your dataset?

The main features is quality. I’d like to explore the impact of other features on quality of red wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think pH, acidity , suplur,sulphates and residul sugar are likely to determine the quality of red wine.

Did you create any new variables from existing variables in the dataset?

I created a new variable called quality.category, which is a factor varible created from numerical value of qulaity. quality.category has three levels : Bad, Average and Good. Bad: quality rating <= 4, Average: rating = 5 and rating = 6, Good: rating >= 7.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

sulphates, free.sulfur.dioxide, total.sulfur.dioxide, and residual.sugar had skewed distribution. I log-10 transformed them to make them close to normal distribution and get better view at them.

Bivariate Plots Section

## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192
## 
##  Pearson's product-moment correlation
## 
## data:  red_wine$quality and red_wine$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

Following variables had negative corelations value. total.sulfur.dioxide free.sulfur.dioxide volatile.acidity pH density chlorides

Following had positive values. fixed.acidity sulphates residual.sugar alcohol

With high value of sulphur wine quality is average and bad with low value of sulphur.

With high value of sulphur wine quality is average and bad with low value of sulphur.

Volatile acidity has negative affect on quality.

pH value does not have significant impact on the quality rating. Low pH rating has slight better rating.

Low density seems to have a good rating.

High value of chrolides have bad ratings, and low value of chlorides have better rating.

Fixed acidity almost has no effect on quality.

Sulphates has positive affect on quality.

Residual sugar has no affect on quality.

Citric acid has positive affect on quality.

Alcohol has positive affect on quality.

Getting the corealtions between all the variables.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

From above table we can see that

  1. pH and density have strong relation with fixed.acidity.

  2. alcohal has a strong realtion with density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality has moderate correlation with sulphates, alcohol, and volatile acidity. As alcohol increases quality of wine also increases. Increase in sulphates tends to increase wine quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I found that alcohol and density of the wine are negatively correlated. Volatile acidity had a positive correlation with pH. Density and fixed acixdity are positively corelated.

What was the strongest relationship you found?

The strongest relationship I found was between pH and fixed acidity.

Multivariate Plots Section

Wine quality is good when alcohol and pH level are high.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

High Alcohol and Sulaphate value produce better quality wines. With alcohol constant, red wine with higher amount of volatile acidity always have a worse quality than those with lower amount of volatile acidity.

Were there any interesting or surprising interactions between features?

I found the interaction between citric acid and quality is very interesting. Quality rating increases with increase value of citric acid.


Final Plots and Summary

Plot One

Description One

Volatile acidity has a negative impact on red wine. The median of volatile acidity with high-quality red wine is much lower than that with low-quality red wine. Volatile acidity is an important feature in detemining qualioty of red wine.

Plot Two

Description Two

Good quality wines have higher value for alcohol and sulphates. High alcohol contents and high sulphate concentrations together seem to produce better wines.

Plot Three

Description Three

Red wine of higher quality tends to contain more alcohol and have a lower density overall.


Reflection

The dataset had 1599 entries and 13 variables. Most of the entries had the quality rating between 5 and 7 i.e. aevarge. I first plotted histogram of individual variables to inderstand the data, and then I plot the corelation matrix of every vaibales to understand the relation between each variables.

For Bivariate analysis, I plotted boxplots for different variables against quality, to get the relation ships between them. After that I plotted different ste of varibles to get bivariate relationships.

For Mutivariate analysis, I plotted different combinations of variables to find out which variables together affect the quality of red wine.

For future analysis, I would like to work on statistical models and learn more techniques to plot better graphs.